feat(gax): implement dynamic channel refreshing on 401 retries#13212
feat(gax): implement dynamic channel refreshing on 401 retries#13212blakeli0 wants to merge 1 commit into
Conversation
4f508a8 to
9e55d01
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to automatically refresh transport channels when an UnauthenticatedException occurs, specifically within environments where the isMwlidEnvironment variable is set. Key changes include adding a refresh method to the TransportChannel interface and ChannelPool implementation, incorporating a 5-second debounce for refreshes, and updating the retry logic to trigger these refreshes. Review feedback highlights a potential bug in the debounce initialization, suggests using constants for magic numbers, recommends caching environment variable lookups to improve performance, and advises using imports instead of fully qualified names for better readability.
| private ScheduledFuture<?> resizeFuture = null; | ||
|
|
||
| private final Object entryWriteLock = new Object(); | ||
| private long lastRefreshTimeNanos = 0; |
There was a problem hiding this comment.
Initializing lastRefreshTimeNanos to 0 can lead to the first refresh being skipped if System.nanoTime() returns a value close to zero (which is possible depending on the JVM's arbitrary time origin). Additionally, the 5-second debounce interval should be defined as a constant.
| private long lastRefreshTimeNanos = 0; | |
| private static final long REFRESH_DEBOUNCE_THRESHOLD_NANOS = java.util.concurrent.TimeUnit.SECONDS.toNanos(5); | |
| private long lastRefreshTimeNanos = System.nanoTime() - REFRESH_DEBOUNCE_THRESHOLD_NANOS; |
| // replaces the list) | ||
| synchronized (entryWriteLock) { | ||
| long now = System.nanoTime(); | ||
| if (now - lastRefreshTimeNanos < TimeUnit.SECONDS.toNanos(5)) { |
| */ | ||
| @Override | ||
| public ApiFuture<ResponseT> submit(RetryingFuture<ResponseT> retryingFuture) { | ||
| if ("true".equalsIgnoreCase(System.getenv("isMwlidEnvironment"))) { |
| if (cause instanceof com.google.api.gax.rpc.UnauthenticatedException) { | ||
| RetryingContext context = retryingFuture.getRetryingContext(); | ||
| if (context instanceof com.google.api.gax.rpc.ApiCallContext) { | ||
| com.google.api.gax.rpc.TransportChannel transportChannel = | ||
| ((com.google.api.gax.rpc.ApiCallContext) context).getTransportChannel(); |
| if ("true".equalsIgnoreCase(System.getenv("isMwlidEnvironment")) | ||
| && previousThrowable instanceof UnauthenticatedException) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
9e55d01 to
188158f
Compare
| lastAttemptResult.get(); | ||
| } catch (java.util.concurrent.ExecutionException e) { | ||
| Throwable cause = e.getCause(); | ||
| if (cause instanceof com.google.api.gax.rpc.UnauthenticatedException) { |
There was a problem hiding this comment.
Having com.google.api.gax.rpc classes (e.x. ApiCallContext, TransportChannel) inside ScheduledRetryingExecutor creates a circular package dependency with com.google.api.gax.retrying package.
We can define the MtlsRotationHandler within gax.retrying which can be inferred from the RetryingContext. The executor only interacts with the MtlsRotationHandler.
vverman
left a comment
There was a problem hiding this comment.
In general I think this design is elegant and handles the unary approach well. There might be changes needed to accomodate cert-mismatches and avoid circular dependencies.
The only open concern remains streaming requests which we cannot leverage the RetryExecutor for. Since we don't want to retry a failed stream request and instead just refresh the channel, IIUC, we would be restricted to using a per call interceptor.
| public boolean shouldRetry( | ||
| RetryingContext context, Throwable previousThrowable, ResponseT previousResponse) { | ||
| if ("true".equalsIgnoreCase(System.getenv("isMwlidEnvironment")) | ||
| && previousThrowable instanceof UnauthenticatedException) { |
There was a problem hiding this comment.
nit: This checks only for UNAUTHENTICATED requests, instead we should check for cert-mismatch which can be done by passing some param with the context. However, it is key that we avoid redundant disk-reads since many requests could fail simultaneously.
| // replaces the list) | ||
| synchronized (entryWriteLock) { | ||
| long now = System.nanoTime(); | ||
| if (now - lastRefreshTimeNanos < TimeUnit.SECONDS.toNanos(5)) { |
There was a problem hiding this comment.
nit: One potential concern here is if in the 5 nano second gap the certs rotate and the request fails due to a cert-mismatch. This could lead to valid requests failing.
I think we can build a workaround by using the cert-mismatch as a trigger.
This PR implements dynamic channel refreshing on 401 Unauthenticated retries under the isMwlidEnvironment environment variable. It introduces compile-time type-safe refresh contracts across TransportChannel and ApiCallContext, with debouncing protection in ChannelPool to prevent connection stampedes.